{ "cells": [ { "cell_type": "markdown", "id": "213d6cca", "metadata": {}, "source": [ "# Small Molecule Rediscovery #\n", "\n", "**In this notebook, we will show how to explore the rediscovery capabilities of GB-BI for small molecules.**\n", "\n", "Welcome to our notebook on using GB-GI to rediscover small molecules using the Tanimoto similarity and fingerprints! Because GB-BI is a quality-diversity method, the end result of such an optimisation is a collection of high-scoring molecules (i.e. highly similar to the target molecule) with a diversity of physico-chemical properties. So, if you're eager to learn how to use GB-GI's rediscovery capabilities or you want to know how to use different types of fingerprints, you're in the right place. For rediscovery based on descriptors and conformer aggregate similarity, see the dedicated tutorial." ] }, { "cell_type": "markdown", "id": "9861ed19-0ed3-40fd-b036-35d678844bcd", "metadata": {}, "source": [ "## Running GB-BI\n", "The most basic usage of GB-BI is to simply run the illuminate.py file, without any extra configurations specficied in the command line. In this case, GB-BI will call on the `config.yaml` file in ../Argenomic-GP/configuration folder by making use of Hydra (an open-source Python framework that handle a hierarchical configuration file which can be easily be adapted through the command line.)." ] }, { "cell_type": "code", "execution_count": null, "id": "d1a59a1e-6e9d-400c-bcfe-949843a42b44", "metadata": {}, "outputs": [], "source": [ "python GB-BI.py" ] }, { "cell_type": "markdown", "id": "4cfa2c99-a2b3-4a93-9efd-d1026f3cb828", "metadata": {}, "source": [ "While initially the configuration file seem complex or even intimidating, each of the individual components are very easy to understand and adapt. Later on in this tutorial, we will go into detail for each of the components, but for now we just want to highlight a few key aspects. " ] }, { "cell_type": "code", "execution_count": 4, "id": "856e6d07", "metadata": {}, "outputs": [], "source": [ "---\n", "controller:\n", " max_generations: 100\n", " max_fitness_calls: 1000\n", "archive:\n", " name: archive_150_4_25000\n", " size: 150\n", " accuracy: 25000\n", "descriptor:\n", " properties:\n", " - Descriptors.ExactMolWt\n", " - Descriptors.MolLogP\n", " - Descriptors.TPSA\n", " - Crippen.MolMR\n", " ranges:\n", " - - 225\n", " - 555\n", " - - -0.5\n", " - 5.5\n", " - - 0\n", " - 140\n", " - - 40\n", " - 130\n", "fitness:\n", " type: Fingerprint\n", " target: \"O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4\"\n", " representation: ECFP4\n", "arbiter:\n", " rules:\n", " - Glaxo\n", "generator:\n", " batch_size: 40\n", " initial_size: 40\n", " mutation_data: data/smarts/mutation_collection.tsv\n", " initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi\n", "surrogate:\n", " type: Fingerprint\n", " representation: ECFP4\n", "acquisition:\n", " type: Mean" ] }, { "cell_type": "markdown", "id": "42b75ed1-e897-44d4-8c46-83cf2d55dd60", "metadata": {}, "source": [ "This standard config file specifies GB_BI run for a maximum of 100 generations or 1000 fitness function calls with an archive containing 150 niches. This is also the maximum amount of molecules that could potentially make up the evolutionary population at any given generation. The archive is spanned by 4 descriptors (`Descriptors.ExactMolWt`, `Descriptors.MolLogP`, `Descriptors.TPSA`, `Crippen.MolMR`) an limited to corresponding ranges indicated in the configuration file. The fitness function is the Tanimoto similarity (standard for all fingerprint representations) applied to ECFP4 fingerprints. The target, Troglitazone, is specified by its SMILES representation. Finally, the molecular representation (`ECFP4 fingerprints`) for the surrogate model and the acquistion function (`Posterior Mean`) are also specified." ] }, { "cell_type": "markdown", "id": "bf86efde-f693-434d-ae72-787e948c079d", "metadata": {}, "source": [ "## Interpreting the Output of a GB-BI Optimisation Run ##" ] }, { "cell_type": "markdown", "id": "3ea4732a-c2e7-4d80-b2de-b44fe7029c39", "metadata": {}, "source": [ "In the same way we rely on HYDRA to take care of configurations files, we also use it to autmatically generate date and time-stamped output folders. For GB-BI, these folders contain a series of `archive_i.csv` files where `i` is the generation at the of which the archive was printed to file. " ] }, { "cell_type": "code", "execution_count": 25, "id": "d1640914-1658-43c2-b969-7436b6c872a3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
acquisition_valuefitnessniche_indexpredicted_fitnesspredicted_uncertaintysmiles
00.4250270.44898030.4250270.000435O=C1NC(=O)C(Cc2ccc(OCC=C(O)Br)cc2)S1
10.3657070.37735840.3657070.000271Cc1c(C)c2c(c(C)c1O)CCC(C)(CN1CCN(C)C1)O2
20.4238950.42727350.4238950.000272CN1CCN(C(O)COc2ccc(CC3SC(=O)NC3=O)cc2)CC1
30.5372050.55208360.5372050.000205Cc1c(C)c2c(c(C)c1O)CCC(C)(Cc1ccc(Br)cc1)O2
40.3610910.32631670.3610910.000935O=C1NC(=O)C(Cc2ccccc2)S1
50.4366280.44000080.4366280.000139CN(Br)N(Br)COc1ccc(CC2SC(=O)NC2=O)cc1
60.6497360.652632100.6497360.000634Cc1c(C)c2c(c(C)c1O)CCC(C)(CONCC1SC(=O)NC1=O)O2
70.2082110.215686110.2082110.000423O=C1NC(=O)C(COCC(O)OBr)S1
80.7136240.765957140.7136240.000880Cc1c(C)c2c(c(C)c1O)CCC(C)(c1ccc(CC3SC(=O)NC3=O...
90.8009241.000000180.8009240.000857Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3...
\n", "
" ], "text/plain": [ " acquisition_value fitness niche_index predicted_fitness \\\n", "0 0.425027 0.448980 3 0.425027 \n", "1 0.365707 0.377358 4 0.365707 \n", "2 0.423895 0.427273 5 0.423895 \n", "3 0.537205 0.552083 6 0.537205 \n", "4 0.361091 0.326316 7 0.361091 \n", "5 0.436628 0.440000 8 0.436628 \n", "6 0.649736 0.652632 10 0.649736 \n", "7 0.208211 0.215686 11 0.208211 \n", "8 0.713624 0.765957 14 0.713624 \n", "9 0.800924 1.000000 18 0.800924 \n", "\n", " predicted_uncertainty smiles \n", "0 0.000435 O=C1NC(=O)C(Cc2ccc(OCC=C(O)Br)cc2)S1 \n", "1 0.000271 Cc1c(C)c2c(c(C)c1O)CCC(C)(CN1CCN(C)C1)O2 \n", "2 0.000272 CN1CCN(C(O)COc2ccc(CC3SC(=O)NC3=O)cc2)CC1 \n", "3 0.000205 Cc1c(C)c2c(c(C)c1O)CCC(C)(Cc1ccc(Br)cc1)O2 \n", "4 0.000935 O=C1NC(=O)C(Cc2ccccc2)S1 \n", "5 0.000139 CN(Br)N(Br)COc1ccc(CC2SC(=O)NC2=O)cc1 \n", "6 0.000634 Cc1c(C)c2c(c(C)c1O)CCC(C)(CONCC1SC(=O)NC1=O)O2 \n", "7 0.000423 O=C1NC(=O)C(COCC(O)OBr)S1 \n", "8 0.000880 Cc1c(C)c2c(c(C)c1O)CCC(C)(c1ccc(CC3SC(=O)NC3=O... \n", "9 0.000857 Cc1c(C)c2c(c(C)c1O)CCC(C)(COc1ccc(CC3SC(=O)NC3... " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "archive_0 = pd.read_csv('./Example_Output/archive_100.csv')[['acquisition_value', 'fitness', 'niche_index',\t'predicted_fitness', 'predicted_uncertainty',\t'smiles']]\n", "archive_0.head(10)" ] }, { "cell_type": "markdown", "id": "5f7b207f-b94b-49de-bd8b-8bc8da108f62", "metadata": {}, "source": [ "In addition, the folder also contains a `molecules.csv` file and a `statistics.csv` file. These files respectively comprise of all molecules (and their properties) that have been evalauted by the fitness function and the overall statistics of the archive and the GB-BI run at each generation. " ] }, { "cell_type": "code", "execution_count": 24, "id": "af9c0263-390d-44fe-ac68-d98ade7a89ab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
generationmaximum fitnessmean fitnessquality diversity scorecoveragefunction callsmax_errmsemae
000.4188030.3776483.3988366.00000024NaNNaNNaN
110.4959350.3268909.80671320.000000540.1897960.0099390.078342
220.4959350.33760312.15372024.000000850.0768840.0013550.029830
330.4959350.36082414.43296926.6666671190.0603530.0007540.021815
440.5648150.37556015.77354128.0000001550.1064450.0014860.028396
550.5925930.38628817.38297230.0000001900.0894150.0014110.030500
660.7916670.39784219.09639832.0000002260.2259880.0027620.032465
770.8172040.40236620.52066234.0000002630.0939530.0008310.019793
880.8172040.41543321.60249834.6666673010.0919790.0007160.018666
991.0000000.42993622.35668134.6666673380.1990760.0017000.022910
\n", "
" ], "text/plain": [ " generation maximum fitness mean fitness quality diversity score \\\n", "0 0 0.418803 0.377648 3.398836 \n", "1 1 0.495935 0.326890 9.806713 \n", "2 2 0.495935 0.337603 12.153720 \n", "3 3 0.495935 0.360824 14.432969 \n", "4 4 0.564815 0.375560 15.773541 \n", "5 5 0.592593 0.386288 17.382972 \n", "6 6 0.791667 0.397842 19.096398 \n", "7 7 0.817204 0.402366 20.520662 \n", "8 8 0.817204 0.415433 21.602498 \n", "9 9 1.000000 0.429936 22.356681 \n", "\n", " coverage function calls max_err mse mae \n", "0 6.000000 24 NaN NaN NaN \n", "1 20.000000 54 0.189796 0.009939 0.078342 \n", "2 24.000000 85 0.076884 0.001355 0.029830 \n", "3 26.666667 119 0.060353 0.000754 0.021815 \n", "4 28.000000 155 0.106445 0.001486 0.028396 \n", "5 30.000000 190 0.089415 0.001411 0.030500 \n", "6 32.000000 226 0.225988 0.002762 0.032465 \n", "7 34.000000 263 0.093953 0.000831 0.019793 \n", "8 34.666667 301 0.091979 0.000716 0.018666 \n", "9 34.666667 338 0.199076 0.001700 0.022910 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "statistics = pd.read_csv('./Example_Output/statistics.csv')\n", "statistics.head(10)" ] }, { "cell_type": "markdown", "id": "97f34aff", "metadata": {}, "source": [ "## Overwriting the Configuration File from the Command Line ##\n", "\n", "All the settings specified in the standard configuration file can easily be overwritten in the command line. For example, we can change the target to `Tiotixene` and change the acquistion function to the `expected improvement` (EI). " ] }, { "cell_type": "code", "execution_count": null, "id": "1cc18d8a-4feb-4cba-8174-b01e57de91ac", "metadata": {}, "outputs": [], "source": [ "python GB-BI.py fitness.target=\"O=S(=O)(N(C)C)c2cc1C(\\c3c(Sc1cc2)cccc3)=C/CCN4CCN(C)CC4\" acquisition.type=EI" ] }, { "cell_type": "markdown", "id": "4869a3a8", "metadata": {}, "source": [ "Thanks to the capabilities of HYDRA, we can also easily set up a multirun of the GB-BI algorithm which sweeps throuhg the different combinations of the configurations specified in the command line. For instance, here we start a multirun with `ECFP4` and `FCFP4` as molecular representations of the surrogate model and using the `posterior mean` (Mean) and the `expected improvement` (EI) as acquisition functions." ] }, { "cell_type": "code", "execution_count": null, "id": "8c6a2fad-efdf-4a6e-bfad-542879f74365", "metadata": {}, "outputs": [], "source": [ "python GB-BI.py hydra.mode=MULTIRUN surrogate.representation=ECFP4,FCFP4 acquisition.type=Mean,EI" ] }, { "cell_type": "markdown", "id": "74d8d83b", "metadata": {}, "source": [ "## Components of the Configuration File ##\n", "\n", "As mentioned before, we will discuss each component of the configuration file. We will go through the different options for each of these components one by one and discuss potential pitfalls. The different components of the configuration file correspond to diffferent classes in the code." ] }, { "cell_type": "markdown", "id": "53282940-cfe6-4e5c-9515-06238bd02350", "metadata": {}, "source": [ "**Controller:** This class controls the basic overall settings of a GB-BI run. In the configuration file, you can set the maximum amount of generations and the maximum amount of fitness calls alloted to the optimsation process. The GB-BI run will end when either has been reached. " ] }, { "cell_type": "code", "execution_count": 13, "id": "0826fcff", "metadata": {}, "outputs": [], "source": [ "controller:\n", " max_generations: 100\n", " max_fitness_calls: 1000" ] }, { "cell_type": "markdown", "id": "6c4ec758-23d4-4ee6-9644-8773a2ba9d4d", "metadata": {}, "source": [ "**Archive:** This class controls the quality-diversity archive at the core of GB-BI. The amount of niches is set in this part of the configuration file, but also two more technical aspects. The archive makes use of a centroidal Voronoi tessellation which needs to be sampled at the beginning of each run (with a certain accuracy). To avoid unncessary re-sampling the centroidal Voronoi tessellation can be stored in a file, the name of which is set here. note that the name should be changed if you change the accuracy, dimensionality, or size of the archive. The dimensionality is set in the next part of the configuration file." ] }, { "cell_type": "code", "execution_count": null, "id": "868f10a6-1a99-4934-a1d7-1e0e75b329b5", "metadata": {}, "outputs": [], "source": [ "archive:\n", " name: archive_150_4_25000\n", " size: 150\n", " accuracy: 25000" ] }, { "cell_type": "markdown", "id": "ec52e35a-ab5b-44be-95fd-d8f7ab71c966", "metadata": {}, "source": [ "**Descriptor:** This class controls the amount and type of physicochemical properties that are used to describe a molecule and place it in its niche. The amounts of descriptors sets the dimensionality of the archive. When choosing ranges, be careful to make sure your target molecule lies within part of chemical space covered by the archive." ] }, { "cell_type": "code", "execution_count": null, "id": "71068fdf-6b80-4959-ba7b-8a956c99fc1a", "metadata": {}, "outputs": [], "source": [ "descriptor:\n", " properties:\n", " - Descriptors.ExactMolWt\n", " - Descriptors.MolLogP\n", " - Descriptors.TPSA\n", " - Crippen.MolMR\n", " ranges:\n", " - - 225\n", " - 555\n", " - - -0.5\n", " - 5.5\n", " - - 0\n", " - 140\n", " - - 40\n", " - 130" ] }, { "cell_type": "markdown", "id": "4c89e9a9-cf60-46c2-a3fb-2644567f9127", "metadata": {}, "source": [ "**Fitness:** This class controls the type of fitness function that is used during the GB-BI run. For fingerprint-based rediscovery, the options for the representation are `ECFP4`, `ECFP6`, `FCFP4`, and `FCFP6` in line with the representations supported by the Gaucamol rediscovery benchmarks. " ] }, { "cell_type": "code", "execution_count": null, "id": "e4249ca9-bad4-4e9f-8e66-729854b88c7e", "metadata": {}, "outputs": [], "source": [ "fitness:\n", " type: Fingerprint\n", " target: \"O=C1NC(=O)SC1Cc4ccc(OCC3(Oc2c(c(c(O)c(c2CC3)C)C)C)C)cc4\"\n", " representation: ECFP4" ] }, { "attachments": {}, "cell_type": "markdown", "id": "46364584-2b5f-437f-9fb4-40a39e168199", "metadata": {}, "source": [ "**Arbiter:** This class controls the type of structural filters that are used during the GB-BI run. The options for the rules are `Glaxo`, `Dundee`, `BMS`, `PAINS`, `SureChEMBL`, `MLSMR`, `Inpharmatica`, and `LINT`. The SMARTS in these rules are determined by Pat Walters, and more information on them can be found in the original scripts. When choosing rules, be careful to make sure your target molecule adheres to the chosen structural filters." ] }, { "cell_type": "code", "execution_count": null, "id": "48091e2f-efa5-4b2e-ac17-c97930ee3505", "metadata": {}, "outputs": [], "source": [ "arbiter:\n", " rules:\n", " - Glaxo" ] }, { "cell_type": "markdown", "id": "861b8f28-8b53-4124-8c5c-0a69c6264a65", "metadata": {}, "source": [ "Note that structural filters can be combined by adding them to configuration file. The combined options for the rules will be read in as a list." ] }, { "cell_type": "code", "execution_count": null, "id": "488c8d7c-7773-4bc5-a60a-7d1d4ede9a7b", "metadata": {}, "outputs": [], "source": [ "arbiter:\n", " rules:\n", " - Glaxo\n", " - Dundee\n", " - BMS" ] }, { "cell_type": "markdown", "id": "5862957c-9b8f-4424-89d0-ee0d5ad1abe2", "metadata": {}, "source": [ "**Generator:** This class controls the crossovers and mutations in the GB-BI run. The initial size, which is randomly sampled from the inital data (i.e. a column of smiles), can be set in this part of the configuration file together with the batch size for the rest of the optimisation run. The path to the file with the mutation data (SMARTS) can also be changed here. " ] }, { "cell_type": "code", "execution_count": null, "id": "5506e63d-1c5e-4838-8889-cddc900d01a1", "metadata": {}, "outputs": [], "source": [ "generator:\n", " batch_size: 40\n", " initial_size: 40\n", " mutation_data: data/smarts/mutation_collection.tsv\n", " initial_data: data/smiles/guacamol_intitial_rediscovery_troglitazone.smi" ] }, { "cell_type": "markdown", "id": "bb69bb28-660d-4386-bf7e-f127b7592cf9", "metadata": {}, "source": [ "**Surrogate:** This class controls the molecular representation used in surrogate fitness model of the GB-BI run. There are two available types: `Fingerprint` and `String`. For `Fingerprint`, the representation options are `ECFP4`, `ECFP6`, `FCFP4`, `FCFP6`, `RDFP`, `APFP`, and`TTFP`. " ] }, { "cell_type": "code", "execution_count": null, "id": "e3c83805-64a3-4dc1-b03a-48ae3a6ef403", "metadata": {}, "outputs": [], "source": [ "surrogate:\n", " type: Fingerprint\n", " representation: ECFP4" ] }, { "cell_type": "markdown", "id": "df9f28c2-c8f9-42b4-8bcc-4b17c714ada2", "metadata": {}, "source": [ "For `String`, the representation options are `Smiles` and `Selfies`. As these strings will be represented as a bag-of-characters in the Gaussian process, a maximum ngram `max_ngram` also needs to be defined in the configuration file. " ] }, { "cell_type": "code", "execution_count": null, "id": "95795fce", "metadata": {}, "outputs": [], "source": [ "surrogate:\n", " type: String\n", " representation: Selfies\n", " max_ngram: 5" ] }, { "cell_type": "markdown", "id": "06be3918-ec5c-47cb-a5f7-aa6445913c17", "metadata": {}, "source": [ "**Acquisition:** This class controls the type of acquisition function used in the GB-BI run. The options for the type are the posterioir mean `Mean`, the upper confidence bound `UCB`, the expected improvement `EI`, and a numerically stable implementation of the logarithm of the expected improvement `logEI`. " ] }, { "cell_type": "code", "execution_count": null, "id": "06f43da4-f906-402e-94a6-162ea37c8dcd", "metadata": {}, "outputs": [], "source": [ "acquisition:\n", " type: Mean" ] }, { "cell_type": "markdown", "id": "c9a36232-1a2a-470f-b8bb-e00fce1023e7", "metadata": {}, "source": [ "For the the upper confidence bound `UCB`, an additional hyperparameter `beta` which balances exploitation and exploration needs to be defined as well." ] }, { "cell_type": "code", "execution_count": null, "id": "48ca7b4d-22f3-4c45-abd2-134856366b78", "metadata": {}, "outputs": [], "source": [ "acquisition:\n", " type: UCB\n", " beta: 0.2" ] }, { "cell_type": "markdown", "id": "2bef39f2-4477-4a05-92e7-cfce61df6953", "metadata": {}, "source": [ "## References\n", "\n", "* Griffiths, Ryan-Rhys, et al. \"Gauche: A library for Gaussian processes in chemistry.\" Advances in Neural Information Processing Systems 36 (2024).\n", " \n", "* Verhellen, Jonas, and Jeriek Van den Abeele. \"Illuminating elite patches of chemical space.\" Chemical science 11.42 (2020): 11485-11491.\n", " \n", "* Brown, Nathan, et al. \"GuacaMol: benchmarking models for de novo molecular design.\" Journal of chemical information and modeling 59.3 (2019): 1096-1108.\n", "\n", "* Jensen, Jan H. \"A graph-based genetic algorithm and generative model/Monte Carlo tree search for the exploration of chemical space.\" Chemical science 10.12 (2019): 3567-3572." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }